Projects
Crash Clarity: Data-Driven Insights for Enhancing UK Road Safety using statistical models

ABSTRACT
Road accidents and safety remain critical public health concerns worldwide, with significant societal, economic, and emotional impacts. In the United Kingdom, the government provides comprehensive data on road accidents through its Road Accident and Safety Statistics guidance. This academic project leverages these statistics to analyze and interpret the trends, patterns, and contributing factors associated with road accidents in the UK.
The study explores key variables such as accident severity, weather and road conditions, time of day, and demographic factors, providing actionable insights into the circumstances under which accidents are most likely to occur. Utilizing advanced data visualization techniques, including interactive heatmaps and histograms, the project presents complex information in a clear and engaging manner to enhance understanding and foster data-driven decision-making.
The findings emphasize the critical role of environmental and behavioral factors in road safety and aim to support policymakers, researchers, and road users in designing effective interventions to reduce accidents and improve safety measures. This project underscores the importance of leveraging statistical data to promote evidence-based strategies for safer transportation systems.
Visualizing Accident Severity Distribution
This interactive histogram presents a comprehensive analysis of road accident severity levels categorized as Life-Threatening, Significant, and Mild. Each category is visually distinguished using a specific color palette, with red denoting Life-Threatening accidents, orange representing Significant accidents, and green for Mild cases. The visualization provides an intuitive understanding of the frequency distribution of these severity levels, enabling researchers and policymakers to identify patterns and focus on mitigating the most critical accident types. The interactivity of the graph allows for an in-depth examination of accident counts, enhancing data-driven decision-making and supporting evidence-based road safety interventions.
Temporal Analysis of Road Accidents by Time Bands
This interactive visualization categorizes road accidents into five time bands: “Night (Midnight to 5 AM),” “Morning Rush Hour,” “Daytime,” “Evening Rush Hour,” and “Night (8 PM to 11 PM)” using STATS20 guidance. The bar plot highlights accident frequencies with a gradient color scheme, showing the highest occurrences during “Daytime” and “Evening Rush Hour”.
These insights help identify high-risk periods, enabling policymakers and researchers to develop targeted road safety strategies. The interactive design allows for detailed exploration of accident patterns.
Accidents by Weather and Light Conditions
This interactive heatmap analyzes the influence of weather and light conditions on road accidents, highlighting combinations like “Fine without high winds” and “Daylight” with the highest frequencies. A gradient color scale emphasizes accident intensity, with data labels providing exact counts. The visualization aids in identifying high-risk conditions to inform targeted safety measures.
Impact of Weather Conditions on Road Accidents
This interactive bar plot presents the distribution of road accidents under various weather conditions, highlighting categories such as Fine without high winds Raining without high winds, and Fog or mist. The gradient color scale, ranging from light pink to deep red, emphasizes the frequency of accidents, with higher counts visually more prominent. Tooltips provide precise accident counts for each weather condition, enhancing the interpret ability of the data.
The visualization reveals that the majority of accidents occur under Fine without high winds, suggesting that favorable weather does not necessarily mitigate risk. Such insights are critical for policymakers and researchers to understand environmental influences on road safety and to develop targeted prevention strategies.
Accidents by Road Surface Conditions
This interactive bar plot examines the distribution of road accidents across various surface conditions based on STATS19 classifications, such as Dry, Wet or damp, and Snow. Each condition is color-coded for clarity, with tooltips providing detailed accident counts for enhanced interpretability.
The analysis reveals that the majority of accidents occur on Dry surfaces, followed by Wet or damp conditions, while adverse surfaces like Flood and Mud show significantly lower frequencies. These findings emphasize the need to consider surface conditions when implementing road safety measures, particularly for common scenarios like wet or dry roads. The visualization supports data-driven strategies for reducing accidents under diverse environmental conditions.
Random Forest Model to find What combination of factors most strongly predicts accident severity?
A Random Forest model was developed to classify accident severity based on features such as weather conditions, road surface conditions, number of vehicles, and urban or rural location. The model, trained on 100 decision trees, achieved robust classification with reasonable AUC values across all severity levels, as visualized in the ROC curves for each class.
The feature importance plot highlights “Weather Conditions” and “Road Surface Conditions” as the most significant predictors of accident severity, followed by “Number of Vehicles” and “Urban or Rural Area.” These insights provide valuable guidance for prioritizing interventions and refining predictive models to improve road safety outcomes. The analysis underscores the importance of environmental and contextual factors in accident severity classification.




Multinomial Logistic Regression Analysis of Accident Severity by Weather Conditions
A multinomial logistic regression model was developed to examine the relationship between accident severity and weather conditions using a cleaned subset of the dataset. The dataset was partitioned into training (80%) and testing (20%) subsets to ensure robust evaluation. The model achieved convergence after 10 iterations, with residual deviance and AIC values of 100785 and 100793, respectively.
The confusion matrix revealed that the model performed well in classifying higher severity levels, achieving an overall accuracy of 76.06%. The coefficients indicate a positive association between weather conditions and accident severity, suggesting that as adverse weather conditions increase, the likelihood of severe accidents also rises. These findings underscore the critical role of weather in road safety and provide insights for preventive measures.
ROC Curve Analysis for Multinomial Logistic Regression Model
The ROC curve illustrates the predictive performance of the multinomial logistic regression model in classifying accident severity levels (Fatal, Serious, and Slight) based on weather conditions. One-vs-all ROC curves were generated for each class, with distinct color coding: red for Fatal, blue for Serious, and green for Slight.
The curves largely overlap with the diagonal reference line, indicating limited separation between true positive and false positive rates. As a more distanced ROC curve signifies better model performance, these results suggest the need for further feature refinement or model optimization to improve classification accuracy. The AUC values, though reasonable, highlight areas for potential enhancement in predictive capability.
# weights: 9 (4 variable)
initial value 91633.053773
iter 10 value 50406.679444
final value 50392.493956
converged


Logistic Regression for Binary Accident Severity Classification to predict accident severity by rural vs urban areas in certain weather conditions.
A logistic regression model was employed to classify accident severity into binary categories: “Slight” (1) and “Fatal/Serious” (0), using features such as urban or rural area and weather conditions. The model showed a modest reduction in residual deviance (from 91810 to 91030), with an AIC of 91040, suggesting limited improvement over the null model.
The ROC curve yielded an AUC value of 0.5498, indicating the model’s predictive performance is slightly better than random chance. The negative coefficient for “urban_or_rural_area” suggests that accidents in urban areas are more likely to be classified as “Slight,” while the positive coefficient for “weather_conditions” implies a weak association with increased severity. Overall, the model demonstrates minimal predictive capability and requires additional features or refinement to achieve better classification accuracy and practical applicability.

Lasso Logistic Regression for Binary Classification of Accident Severity
A Lasso logistic regression model was applied to classify accident severity into binary categories (“Slight” vs. “Fatal/Serious”) using features such as urban or rural area and weather conditions. The model employed cross-validation to identify the optimal regularization parameter (lambda.min), ensuring reduced overfitting and improved generalizability.
The ROC curve yielded an AUC of 0.5514, indicating a marginally better performance than random guessing. The close proximity of the ROC curve to the diagonal reference line suggests limited predictive power. While the model effectively reduces feature complexity, the low AUC highlights the need for additional predictive variables or refined feature engineering to improve classification accuracy and ensure practical applicability.

Lasso Logistic Regression for Predicting Accident Severity including more variables
A Lasso logistic regression model was implemented to classify accident severity into binary categories: “Severe” (1) and “Slight” (0). The model utilized key features such as road surface conditions, weather conditions, urban or rural area, and time of day. Cross-validation identified the optimal regularization parameter (lambda.min), ensuring feature selection and preventing over fitting.
The model achieved an AUC of 0.612, which is the highest among all models evaluated, as visualized through the ROC curve. This indicates improved predictive performance and better discriminatory power compared to earlier approaches. Feature coefficients highlighted the importance of road surface conditions and urban/rural areas as significant predictors. While the model demonstrates improved performance, further enhancements could refine its applicability for real-world scenarios.


Re-visualization Project Using R-language Introduction:
Suicide is the act of intentionally causing one’s own death. It can be due to many conditions or the situations. It includes Mental disorders, physical disorders, and substance abuse are the risk factors. Suicides resulted in 828,000 deaths globally in 2015, an increase from 712,000 deaths in 1990. This makes suicide the 10th leading cause of death worldwide. Every death from suicide is a tragedy.
The below is the Visualization on Suicides by Saloni Dattani, Lucas Rodes-Guirao, Hannah Ritchie, Max Roser, and Esteban Ortiz-Ospina. The research shows that suicide rates can be reduced with greater understanding and support. To do that the researchers considered or recognized suicide as a public health problem, and people should know that it can be prevented and its rates can be reduced.
Please watch the full project presentation on the following YouTube video
OLD VISUALIZATION:
Suicide rates vary around the world:
Suicide rates vary widely between the countries. The given visualization depicts the data of annual suicide rates per 100,000 people from 1950 to 2022, across various countries. Researchers used line graph to predict the data.
X-axis represents the years from 1950 to 2022 and y-axis represents the suicide rate per 100000 people, ranging from 0 to 40. It also says that higher the value, the greater will be the number of suicide rates.
Each line of the graph represents the countries. The countries which have higher suicide rates are represented on the top. The legends taken are countries.
Observations:
There is a wide range of variations between the countries. Countries like Lithuania, South Korea shows the highest suicide rates, as indicated by their position near the top of the graph.
Some countries shows large fluctuations in the suicide rates while other countries shows the constant rate throughout the years.
It also says that suicide deaths are under-reported in many countries due to social stigma and culture or legal concerns means that actual rates can be higher than the reported rates.
The data is collected based on the data listed in the death certificates. It can impact the accuracy of the data
The data is adjusted for age standardization allowing a fair comparison between the countries with different age structures, ensuring that population age distribution doesn’t skew the data.
Bad Visualization Predictions:
More number of lines: The graph contains a huge number of lines which are representing the countries. This creates a messy graph it is very difficult to predict the data immediately as we look into the graph.
Color Categorization: All the countries represented with different colors but for some countries there are distinct colors where it will be very difficult to categorize the data. There are similar colors in for different countries. We can use more contrasting colors to represent the data or we can group the colors into regions or categories.
Interactive Labeling: With so many lines we cannot identify the particular country instantly and it is impossible to find the particular country and there are all the countries mentioned in the legend where it is impossible to identify the specific country. Hence we can use interactive Labeling for highlighting the particular country.
No Highlights on the key insights: All the lines in the graph are in equal size where there is no differentiation between the countries. We can highlight the countries which have highest suicide rates and lowest suicide rates with different dimensions of the lines.
The above visualization tells us about the reported suicide rates by age in the United States.
Observations:
It explains the breakdown for the rate of suicides for different age groups like children, adults. The data highlights the trends such as the increasing or decreasing risk of suicide within the specific age over time and across different regions.
It shows the data for the suicide rates per 100,000 people across different age groups. Age specific data usually reveals trends showing which age group are more vulnerable to suicide in different regions.
According to the graph it predicts that the old generation people have the higher suicide rates (Age between 80-84). The age between 15-19 suicidal rates are less. But there is growing concern about suicidal rates in young adults particularly due to health conditions and mental stress.
Bad Visualization Predictions:
More number of lines: The graph contains a huge number of lines which are representing the different age groups. This creates a messy graph it is very difficult to predict the data immediately as we look into the graph.
Color Categorization: All the age groups are represented with different colors but for some there are distinct colors where it will be very difficult to categorize the data. There are similar colors for different age groups. We can use more contrasting colors to represent the data or we can group the colors into categories or age groups.
Legend: The legend have too many entries where it is difficult for the user to identify the particular data of the age group in a particular year. Viewers must constantly shift their focus on the legend and the graph simultaneously where it would be difficult for predicting the exact information.
Lack of data insights: There is no contextual information or annotations on the graph to explain significant spikes, trends or sudden drops in the suicide rates for certain age groups.
Interactive Labeling: Adding the interactvite labeling helps to improve the readability.
Re-Visualizations:
According to the above research and bad visualizations found we have made some changes and re-visualized the data as below:
Each map or graph in this project displays suicide rates per 100,000 people to enhance the clarity and effectiveness of the visualization and this is the standard that data analysts generally follow while visualizing death related data.
Average Suicidal Rates By Country from 1950 to 2022:
The map below illustrates the average suicide rates by country from 1950 to 2022, broken down by different age groups such as children, young adults, and adults across all nations.
In the previous visualization, the data was presented in a line graph for all countries, resulting in a cluttered and hard-to-read display. To improve clarity, we have re-visualized the data by focusing on the average suicide rates from 1950 to 2022 using a world map. In the provided dataset, we calculated the average suicide rates over the years and made predictions based on that data. The world map offers an easier and more intuitive way to interpret the data. This updated map is also interactive, allowing users to highlight specific parameters and explore the average range of deaths by suicide more flexibly.
Based on the predictions shown in the map, Russia has the highest average suicide rates.
Load required libraries
Average Suicide Rates of all the ages by Country for the top rated year: 1982
The map below shows the average suicide rates for all age groups by country for the year 1982, which was chosen because it had the highest suicide rates between 1982 and 2022.
In the previous visualization, the data was displayed in a line graph, resulting in a cluttered and hard-to-read format. To improve accessibility, we re-visualized the data by calculating the average suicide rates across all years and selected 1982 for its peak in suicide rates.
This updated visualization uses a world map to display the data, with countries categorized by different colors, highlighting the highest suicide rates in red. Russia stands out as having the highest average suicide rates.
Average Suicide Rates by Year all over the World:
The map below shows the average suicide rates by year globally from 1950 to 2022. We re-visualized the data by calculating the average death rates over this period. First, we determined the averages for each year for all the countries, and then we used a frequency polygon graph to visualize the trend of the suicide rates over all the years.
In this representation, the highest suicide rate occurred in 1982, with a rate of 12.38, while the lowest rate across all countries and age groups was recorded in 2016 with 7.21.
Top 5 Years with Highest Suicide Rates in Top 5 Countries:
The graph below shows the top 5 years with the highest suicide rates in the top 5 countries. First, we identified the 5 countries with the highest average suicide rates. After filtering the data to include only these countries, we selected the top 5 years with the highest suicide rates for each. Using this categorized data, we created a bar graph with ggplot. Additionally, we added interactive labeling to enhance accessibility and provide a more user-friendly experience for viewers.
Average Suicide Rates for Ages 15-19 over the years:
The graph below displays the average suicide rates for individuals aged 15-19. The previous visualization focused on overall suicide rates across all years and age groups. For this re-visualization, we specifically selected the 15-19 age group, as it marks the end of teen years. We used a world map to represent the data and incorporated interactive labeling for easier interpretation.
Average Suicide Rate for the Top 20 Nations in the 15-19 Age Group (Across All Years):
The graph below illustrates the average suicide rate for the top 20 countries in the 15-19 age group over the years 1950 to 2021. It highlights the top 20 countries for this age group, with the addition of interactive labeling for enhanced user experience.